Robust 3D Pose Estimation and Efficient 2D Region-Based Segmentation from a 3D Shape Prior
نویسندگان
چکیده
In this work, we present an approach to jointly segment a rigid object in a 2D image and estimate its 3D pose, using the knowledge of a 3D model. We naturally couple the two processes together into a unique energy functional that is minimized through a variational approach. Our methodology differs from the standard monocular 3D pose estimation algorithms since it does not rely on local image features. Instead, we use global image statistics to drive the pose estimation process. This confers a satisfying level of robustness to noise and initialization for our algorithm, and bypasses the need to establish correspondences between image and object features. Moreover, our methodology possesses the typical qualities of region-based active contour techniques with shape priors, such as robustness to occlusions or missing information, without the need to evolve an infinite dimensional curve. Another novelty of the proposed contribution is to use a unique 3D model surface of the object, instead of learning a large collection of 2D shapes to accommodate for the diverse aspects that a 3D object can take when imaged by a camera. Experimental results on both synthetic and real images are provided, which highlight the robust performance of the technique on challenging tracking and segmentation applications. 1 Motivation and Related Work 2D image segmentation and 2D-3D pose estimation are ubiquitous tasks in computer vision applications and have received a great deal of attention in the past few years. These two fundamental techniques are usually studied separately in the literature. In this work, we combine both approaches in a variational framework. To appreciate the contribution of this work, we recall some of the results and specifics of both fields. 2D-3D pose estimation aims at determining the pose of a 3D object relative to a calibrated camera from one unique or a collection of 2D images. By knowing the mapping between the world coordinates and image coordinates from the camera calibration matrix, and after establishing correspondences between 2D features in the image and their 3D counterparts on the model, it is then possible to solve the pose transformation (from a set of equations that express these correspondences). The literature concerned with 3D pose estimation is very large and a complete survey is beyond the scope of this paper. However, most methods can be distinguished by the type of local image features used to establish correspondences, such as points [1], lines or segments [2, 3], multi-part curve segments [4], or complete contours [5, 6]. Segmentation consists of separating an object from the background in an image. The geometric active contour (GAC) framework, in which a curve is evolved continuously to 2 Samuel Dambreville, Romeil Sandhu, Anthony Yezzi, and Allen Tannenbaum capture the boundaries of an object, has proven to be quite successful at performing this task. Originally, the method focused on extracting local image features such as edges to perform segmentation; see [7, 8] and the references therein. However, edge-based techniques can suffer from the typical drawbacks that arise from using local image features: high sensitivity to noise or missing information, and a multitude of local minima that result in poor segmentations. Region-based approaches, which use global image statistics inside and outside the contour, were shown to drastically improve the robustness of segmentation results [9–12]. Region-based techniques are able to deal with various statistics of the object and background such as distinct mean intensities [10], Gaussian distributions [11, 12] or intensity histograms [13, 14] as well as a wide variety of photometric descriptors such as grayscale values, color or texture [15]. Further improvement of the GAC approach consists of learning the shape of objects and constrain the contour evolution to adopt familiar shapes, to make up for poor segmentation results obtained in the presence of noise, clutter, occlusion or when the statistics of the object and background are difficult to distinguish (see e.g., [16–19]). Motivation/Contribution: Our goal is to combine the strengths of both techniques and to avoid some of their typical weaknesses, to robustly both segment 2D images and estimate the pose of an arbitrary 3D object which shape is known. In particular, we use a region-based approach to continuously drive the pose estimation process. This global approach avoids using local image features and, hence, addresses two shortcomings that typically arise from doing so in many 2D-3D pose estimation algorithms: Firstly, finding the correspondence between local features in the image and on the model is a non-trivial task, due for instance to their viewpoint dependency no local correspondences need to be found in our global approach. Secondly, local image features may not even exist or can be difficult to detect in a reliable and robust fashion in the presence of noise clutter or occlusion. Furthermore, simplifying assumptions usually need to be made on the class of shapes that a 2D-3D pose estimation technique can handle. Many approaches are limited to simple shapes that can be described using geometric primitives such as corners, lines, circles or cylinders. Recent work focused on free-form objects, which admit a manageable parametric description as in [5]. However, even this type of algebraic approaches can become unmanageable for objects of arbitrary and complex shape. Our approach can deal with rigid object of arbitrary shape, represented by a 3D level set [20] or a 3D cloud of points (Figure 1). Conversely, a shortcoming of the GAC framework using shape priors is that 2D shapes are usually learned to segment 2D images. Hence, a large collection of 2D shapes needs to be learned to represent the wide variation in aspect that most natural 3D objects take, when projected onto the 2D image plane. Our region-based approach benefits from the knowledge of the object shape that is compactly described by a unique 3D model. Acquisition of 3D models can be readily accomplished using range scans [21] or structure from motion approaches [22], notably. In addition, and in contrast to the GAC framework, the proposed method does not involve the evolution of an infinite dimensional contour to perform segmentation, but only solves for the finite dimensional pose parameters (as is common for 2D-3D pose estimation approaches). This results in a much simplified framework that avoids dealing with problems such as infinite dimensional curve representation, evolution and regularization . Title Suppressed Due to Excessive Length 3 Relation to Previous Work: In this paper, we exploit many ideas from recent variational approaches that address the problem of structure from motion and stereo reconstruction from multiple cameras ([23, 22] or [24]). Originally, the authors in [23, 22] presented a method to reconstruct the 3D shape of an object from multiple 2D views obtained from calibrated cameras. The present contribution aims at performing a somewhat opposite task: given the 3D model of an object, perform the segmentation of 2D images and recover the 3D pose of the object relative to a unique camera. This is the first time that the framework in [23, 22] is adapted and employed in the specific context of segmenting 2D images from a unique camera, using the knowledge of a 3D model. The framework in [22] has also recently been extended in [25] to address the problem of multiple camera calibration. In the present work, the camera is assumed to be calibrated. However, this assumption could easily be dropped by also solving for the optimal camera calibration parameters as presented in [25]. We note that, although the use of 3D shape knowledge to perform the 2D segmentation of regions presents obvious advantages, the literature dealing with this type of approaches is strikingly thin. The pieces of work closest to the proposed contribution are probably [26] and [27]. In [26], the authors evolve an (infinite dimensional) active contour as well as 3D pose parameters to minimize a joint energy functional encoding both image information and 3D shape knowledge. Our method differs from the aforementioned approach in many crucial aspects: We optimize a unique energy functional, which allows us to circumvent the need to determine ICP-like correspondences and to perform costly back-projections between the segmenting contour and the shape model at each iteration. Also, we perform optimization only in the finite dimensional space of the Euclidean pose parameters. In addition to being computationally efficient, this allows our technique to be less likely to be trapped in local minima, resulting in robust performances as demonstrated in the experimental part. In [27], the method of [26] is successfully simplified by performing energy minimization only in the space of 3D pose parameters. Thus, the method of [27] and our contribution present some similarities. However, the energy minimization method and resulting algorithms are radically different: In [27], an algebraic approach is used that involves establishing correspondences and back-projections between the 3D and 2D world, as well as linearizing the resulting system of equations. Consequently, important information about the geometry of the 3D model is lost through the algebraic approach. In contrast, our approach relies on surface differential geometry (see e.g., [28]) to link geometric properties of the model surface and its projection in the image domain. This allows us to derive the partial differential equations necessary to perform energy optimization, as well as exploit the knowledge of the 3D object to its full extent. Our technique uses a 3D shape prior in a region-based framework, and can thereby be expected to be robust to noise or occlusion. Hence, an obvious application of the proposed approach is the robust tracking of 3D rigid objects in 2D image sequences. Our approach is, thus, also related to a wealth of methods concerned with the problem of model-based monocular tracking (see [29] for a recent survey). 4 Samuel Dambreville, Romeil Sandhu, Anthony Yezzi, and Allen Tannenbaum Fig. 1. Left: Schema summarizing our segmentation/pose estimation approach from a 3D model, in 4 steps. Right: Different views of the 3D models used (rendered surfaces or cloud of points).
منابع مشابه
A Geometric Approach to Joint 2D Region-Based Segmentation and 3D Pose Estimation Using a 3D Shape Prior
In this work, we present an approach to jointly segment a rigid object in a two-dimensional (2D) image and estimate its three-dimensional (3D) pose, using the knowledge of a 3D model. We naturally couple the two processes together into a shape optimization problem and minimize a unique energy functional through a variational approach. Our methodology differs from the standard monocular 3D pose ...
متن کاملCar Segmentation and Pose Estimation using 3D Object Models
Image segmentation and 3D pose estimation are two key cogs in any algorithm for scene understanding. However, state-of-the-art CRF-based models for image segmentation rely mostly on 2D object models to construct top-down high-order potentials. In this paper, we propose new topdown potentials for image segmentation and pose estimation based on the shape and volume of a 3D object model. We show t...
متن کاملRobust Pose Estimation with 3D Textured Models
Estimating the pose of a rigid body means to determine the rigid body motion in the 3D space from 2D images. For this purpose, it is reasonable to make use of existing knowledge of the object. Our approach exploits the 3D shape and the texture of the tracked object in form of a 3D textured model to establish 3D2D correspondences for pose estimation. While the surface of the 3D free-form model i...
متن کاملTarget detection Bridge Modelling using Point Cloud Segmentation Obtained from Photogrameric UAV
In recent years, great efforts have been made to generate 3D models of urban structures in photogrammetry and remote sensing. 3D reconstruction of the bridge, as one of the most important urban structures in transportation systems, has been neglected because of its geometric and structural complexity. Due to the UAV technology development in spatial data acquisition, in this study, the point cl...
متن کاملAutomatic, Effective, and Efficient 3D Face Reconstruction from Arbitrary View Image
In this paper, we propose a fully automatic, effective and efficient framework for 3D face reconstruction based on a single face image in arbitrary view. First, a multi-view face alignment algorithm localizes the face feature points, and then EM algorithm is applied to derive the optimal 3D shape and position parameters. Moreover, the unit quaternion based pose representation is proposed for ef...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2008